Text Clustering Using Cosine Similarity and Matrix Factorization

نویسندگان

  • R. Umamaheswari
  • K. Rajesh
چکیده

Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Text-clustering is to divide a collection of textdocuments into different categories so that documents in the same category describe the same topic such as classical music. Text Clustering efficiently groups documents with similar content into same cluster: Similarity between objects is measured within the use of similarity function. The hierarchical clustering schemes can be effectively used for processing large datasets. In this paper, it is proposed to use the hierarchical clustering technique entitled “sub leader algorithm” along with cosine similarity is to cluster the documents. Key words-Similarity measures, text clustering, data

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GWU NLP at SemEval-2016 Shared Task 1: Matrix Factorization for Crosslingual STS

We present a matrix factorization model for learning cross-lingual representations for sentences. Using sentence-aligned corpora, the proposed model learns distributed representations by factoring the given data into language-dependent factors and one shared factor. As a result, input sentences from both languages can be mapped into fixed-length vectors and then compared directly using the cosi...

متن کامل

Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents

In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. The key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. The extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with...

متن کامل

Comparison Clustering using Cosine and Fuzzy set based Similarity Measures of Text Documents

Keeping in consideration the high demand for clustering, this paper focuses on understanding and implementing K-means clustering using two different similarity measures. We have tried to cluster the documents using two different measures rather than clustering it with Euclidean distance. Also a comparison is drawn based on accuracy of clustering between fuzzy and cosine similarity measure. The ...

متن کامل

Nonlocal Total Variation with Primal Dual Algorithm and Stable Simplex Clustering in Unsupervised Hyperspectral Imagery Analysis

We focus on implementing a nonlocal total variational method for unsupervised classification of hyperspectral imagery. We minimize the energy directly using a primal dual algorithm, which we modified for the non-local gradient and weighted centroid recalculation. By squaring the labeling function in the fidelity term before re-calculating the cluster centroids, we can implement an unsupervised ...

متن کامل

Text Document Clustering based on Phrase

Affinity propagation (AP) was recently introduced as an unsupervised learning algorithm for exemplar based clustering. In this paper novel text document clustering algorithm has been developed based on vector space model, phrases and affinity propagation clustering algorithm. Proposed algorithm can be called Phrase affinity clustering (PAC). PAC first finds the phrase by ukkonen suffix tree con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014